2025-09-18

NUMA-Aware Relocation in ZGC

NUMA-aware relocation is a feature recently added to ZGC, one of the garbage collectors in the OpenJDK, and is set to be released in JDK 26 through the introduction of JDK-8359683. Building on the recent memory allocation overhaul (see “How ZGC allocates memory for the Java heap”), this feature further enhances NUMA support and optimization in ZGC. For those interested in the code changes, see the relevant GitHub pull request.

This post is a technical deep dive into how NUMA is incorporated into the Relocation phase of ZGC. We’ll start with some background on the Relocation phase, to provide context for both the domain and the previous non-NUMA implementation. From there, we’ll explore why NUMA matters for GC performance and walk through the details of how NUMA-aware relocation is implemented in practice.

The primary goal of NUMA-aware relocation is to keep objects close to the threads that access them. By optimizing memory locality, it minimizes slower cross-node memory accesses in favor of faster local accesses. Although this approach may slightly increase the workload for the garbage collector, it typically results in an improvement in overall application performance.

Background

Relocation Phase

Below is an overview of the order and types of phases that are part of a garbage collection cycle (GC-cycle) in ZGC. The Relocation phase is the final phase in a GC-cycle.

An image depicting the phases in a garbage collection cycle in ZGC

Before the Relocation phase begins, the Marking phase identifies all live (reachable) objects. In ZGC, objects are placed within regions of the heap called pages. Note that ZGC pages are distinct from operating system (OS) memory pages. Pages are categorized into three size classes: Small, Medium, and Large. Only objects on Small and Medium pages are candidates for relocation, since Large pages contain just a single object and cannot be compacted further. Each page tracks its age, representing the number of GC cycles its objects have survived. Pages eventually reach the “Old” age if their objects survive through enough cycles.

Pages that are considered sparsely populated are selected to form the Relocation Set, which is sorted by pages that have the fewest live bytes. Objects on pages in the Relocation Set will be relocated during the Relocation phase into new pages in a compact manner, so that memory can be freed up for new allocations. Objects are moved to a target page with the same age as the page they were originally on.

During relocation, target pages are tracked in a lookup table indexed by page age, known as relocation targets. Ages range from Survivor 1 through Survivor 14, ending with Old.

For Small pages: Each GC thread manages its own table of relocation targets.
For Medium pages: All GC threads share a common table of relocation targets.

If a relocation target for a given age is full or missing, a new page is only allocated whenever a new object is being relocated.

A table of relocation targets. First column shows page age and the second column shows page addresses

If the GC cannot allocate a new target page, which could be due to high memory pressure, an in-place relocation is performed. In this rare scenario, objects are moved within the same page. Although this is slower than relocating to a new page, it helps ensure progress during memory exhaustion. Upon completion, the page itself may be added to the set of available relocation targets.

Threads Relocating Objects

Most objects are relocated by GC threads, which iterate over pages in the relocation set and move objects one at a time. Because ZGC is a concurrent garbage collector, application threads, known as mutators, can potentially race with the GC when accessing objects that are scheduled for relocation. In these cases, the mutator thread assists by relocating the object itself before accessing it. We will consider the role of GC and mutator threads in more detail in #New Design.

Non-Uniform Memory Access (NUMA)

Much of the NUMA functionality in ZGC was redesigned in JDK-8350441, which you can read more about in “How ZGC allocates memory for the Java heap”. Historically, ZGC interleaved memory across all NUMA nodes when allocating pages. While this approach was straightforward and delivered consistently good average performance, the recent redesign enables more advanced strategies that have the potential to further increase performance.

When allocating memory for a new page in the new design, it is preferably allocated on the NUMA node that is local to the thread performing the allocation. The same policy applies when allocating objects, preferably allocating them on a page that has memory local to the NUMA node of the allocating thread. The goal is to leverage the fact that accessing local memory is significantly faster, often up to twice as fast to access than accessing remote memory.

Using the preferred policy does not guarantee local allocation. If the preferred node runs out of memory, the allocation will fall back to a different NUMA node so that the allocation may succeed. See MPOL_PREFERRED in the Linux kernel docs for more details on how the preferred policy works.

NUMA-Aware Relocation

Issues With Original Design

New objects are allocated on the NUMA node local to the allocating thread. This policy assumes that the object will continue to be used either by that thread or another thread on the same node, thereby benefiting from fast local memory. However, this assumption breaks down during relocation. When an object resides on a page selected as part of the relocation set, it may be migrated to a page on a different NUMA node. The relocation target lookup table (see #Relocation Phase) is indexed only by age, not by NUMA node. For Small pages, where each GC thread has its own lookup table, the object is migrated to the node local to the GC thread, which may differ from the original node and potentially lose locality. For Medium pages, the object could be migrated to any NUMA node in the system, losing locality.

The loss of NUMA locality forces mutator threads to access memory from a remote node, which is roughly twice as slow, essentially undoing the very optimization we worked hard to implement by allocating memory locally.

New Design

Requirements

To address these issues, the new design should satisfy the following requirements:

Preserve locality when possible

GC threads should strive to keep objects on their original NUMA node. Ideally, objects should be relocated to a page that resides on the same node.
Respect the NUMA node of mutators

When a mutator helps relocate an object, it should place it on a page local to its own NUMA node, regardless of where the object was originally located. This updates the original assumption: after relocation, the object is expected to be accessed by the same mutator thread or another thread on that node.

This policy is already applied for Small pages and requires no changes. For Medium pages, it is not currently implemented. However, because mutators rarely perform relocation and Medium pages are also relatively uncommon, the immediate performance benefit is limited. Adding NUMA-awareness here is left as a future enhancement.
GC threads should strive to work on local memory

When the GC chooses a page from the relocation set, it should prefer pages local to its NUMA node. This improves performance by ensuring that GC threads primarily operate on local memory.

In the current design, a single parallel iterator is shared by all GC threads, offering no way to select only NUMA-local pages.

NUMA-Aware Relocation Targets

To better control which NUMA node objects are placed on, we extend the original lookup table with an additional NUMA dimension. Instead of indexing solely by page age, the table now uses both NUMA id and page age. When a page is allocated, it is preferably placed on the NUMA node associated with its position in the table. As a result, objects can be relocated to specific NUMA nodes.

NUMA-Aware Relocation Set Iterators

Originally, a single parallel iterator scanned all pages in the relocation set, without regard for which NUMA nodes the pages belonged to.

In the new design, each iterator is tied to a specific NUMA node and only selects pages from that node. This allows GC threads to prioritize relocating NUMA-local pages. To handle a GC thread migrating to another CPU core, which could potentially be on a different NUMA node, the NUMA node association is periodically polled so that the GC thread always selects pages from the most appropriate iterator. If the local iterator runs out of pages, the GC thread falls back to helping with remote pages from another iterator, maintaining balanced progress without sacrificing locality.

An alternative approach would be to create separate relocation sets for each NUMA node. With this design, iterators would only scan pages from their local set, avoiding the need to filter through a global set that contains pages from all nodes. We chose not to pursue this approach because our current design achieves the same effect with less complexity: it requires no changes to how relocation sets are created or sorted, making it simpler to implement.

Discussion

Performance impact on GC threads

In the old design for Small pages, objects were always relocated to the NUMA node local to the GC thread performing the relocation. This had a positive impact on relocation speed, since GC threads accessed local memory for at least the target page. In the new design, objects are relocated to the NUMA node where they were originally allocated. This can slow down a GC thread if those pages are remote, as it may need to access memory on a different NUMA node for both the source and target pages. The impact is mitigated by multiple NUMA-aware iterators, which allow GC threads to preferentially select pages local to their node. To make this effect measurable, we’ve added logging that reports the percentage of pages relocated NUMA-locally by each GC thread. This can be enabled with -Xlog:gc+reloc+numa=debug, and note that the log format may change in future versions.
```
 [69.533s][debug][gc,reloc,numa] Pages relocated NUMA-locally: 800 / 826 (97%)
 [60.939s][debug][gc,reloc,numa] Pages relocated NUMA-locally: 933 / 936 (100%)
 [74.125s][debug][gc,reloc,numa] Pages relocated NUMA-locally: 936 / 949 (99%)
 [78.789s][debug][gc,reloc,numa] Pages relocated NUMA-locally: 907 / 915 (99%)
```
Memory overhead

Making the Relocation phase NUMA-aware introduces additional overhead to track relocation targets, and this overhead scales with the number of NUMA nodes. As reported in JDK-8366476, a machine with 8 NUMA nodes may fail to run ZGC on particularly small heaps. Because NUMA is enabled by default, this means that ZGC may not work out-of-the-box on all NUMA machines with small heaps. However, NUMA optimizations can be disabled using -XX:-UseNUMA, reducing the overhead to non-NUMA levels. NUMA is primarily an optimization for large machines and large heaps, so disabling it for small heaps is a reasonable trade-off.
Relocation set ordering trade-offs

As mentioned in #Relocation Phase, the relocation set is sorted so that pages with the fewest live bytes come first. This ordering helps reclaim memory quickly, which is especially valuable when the machine is under memory pressure. With NUMA-aware iterators, however, this global ordering can no longer be guaranteed: a GC thread may skip over pages with low live bytes on other nodes if it is scheduled on a different NUMA node. In practice, this trade-off is unlikely to be significant. Pages relocated from local memory should complete roughly twice as fast as pages from remote memory, offsetting the fact that they may not always be the pages with the lowest live bytes in the system.
Impact on non-NUMA

Adding NUMA-awareness to the Relocation phase is purely an optimization for NUMA-enabled systems. When NUMA is not used, either on machines without NUMA support or when disabled via -XX:-UseNUMA, the GC behaves identically to the previous design, with no impact on performance or correctness. This ensures that the optimization is safe and transparent for workloads where NUMA is irrelevant, while providing potential benefits on large multi-socket machines.

In summary, NUMA-aware relocation preserves memory locality on large multi-socket machines while remaining transparent and safe on systems without NUMA, with minimal overhead for most workloads. By keeping objects close to the threads that use them, this approach typically improves throughput and latency on NUMA systems, potentially resulting in better overall application performance.

Future Enhancements

Rather than relying on heuristics to guess which CPU cores GC threads will run on, we can leverage functionality from libnuma to explicitly set thread affinity to cores on specific NUMA nodes. This would help align GC thread distribution with the distribution of pages in the relocation set, improving both locality and performance. Two of the considerations discussed above in #Discussion could be mitigated by introducing NUMA thread affinity. For details on how this could be implemented, see numa_node_to_cpus and numa_sched_setaffinity in libnuma, which can be used to fetch CPU to NUMA node association and to set the affinity of threads on Linux.

In addition to the Relocation phase, the Marking phase could also be made NUMA-aware. Unlike Relocation, Marking involves traversing the object graph from a set of known roots to identify live objects. A natural first step for NUMA-awareness is to ensure that GC threads primarily operate on pages that are local to their own NUMA node, maximizing memory access speed.

Additional Notes

There are cases when an allocation might be spread out across multiple NUMA nodes, but this is rare and only used as a last resort. When and how this occurs is detailed in “How ZGC allocates memory for the Java heap”.
On a typical 2-socket NUMA machine, accessing memory on a remote node can be roughly twice as slow as local memory. The relative distances between NUMA nodes can be seen using numactl. Here, the distance values indicate relative latency: lower numbers mean faster access. Node 0 accessing memory on node 1 (distance 20) is roughly twice as slow as accessing local memory (distance 10).
```
$ numactl --hardware
...
node distances:
node     0    1
   0:   10   20
   1:   20   10
```